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Abstract 

Factorization is of fundamental importance 
in the area of Probabilistic Graphical Mod- 
els (PGMs). In this paper, we theoret- 
ically develop a novel mathematical con- 
cept, Co-occurrence Rate (CR), for factor- 
izing PGMs. CR has three obvious advan- 
tages: (1) CR provides a unified mathemati- 
cal foundation for factorizing different types 
of PGMs. We show that Bayesian Network 
Factorization (BN-F), Conditional Random 
Field Factorization (CRF-F), Markov Ran- 
dom Field Factorization (MRF-F) and Re- 
fined Markov Random Field Factorization 
(RMRF-F) are all special cases of CR Fac- 
torization (CR-F); (2) CR has simple proba- 
bility definition and clear intuitive interpre- 
tation. CR-F tells not only the scopes of the 
factors, but also the exact probability func- 
tions of these factors; (3) CR connects proba- 
bility factorization and graph operations per- 
fectly. The factorization process of CR-F can 
be visualized as applying a sequence of graph 
operations including partition, merge, dupli- 
cate and condition to a PGM graph. We fur- 
ther obtain an important result: by CR-F, 
on TCG graphs the scopes of factors can be 
exactly over maximal cliques without any de- 
fault configuration. This improves the results 
of (R)MRF-F which need default configura- 
tions, and also indicates that (R)MRF-F, as 
special cases of CR-F, can not always achieve 
the optimal results of CR-F. 



1 Introduction 

Independence is a very important type of experience 
that can be used to simplify PMs. PGMs are com- 
pact formalizations of independence relations among 
random variables which use different types of graphs 
as their representations. The fundamental problem 
in the area of PGMs is to factorize high dimensional 



joint probabilities into small factors based on the inde- 
pendence relations among random variables. Learning 
and inference algorithms are based on the results of 
factorization. 

Bayesian networks (BNs) are directed acyclic graphs. 
The conditional independence of BNs can be judged by 
d-separation criteria (Pearl 1986). BN-F is based on 



the mathematical concept of conditional probability: 

n 

P(x!,x 2 , ...,x n ) = Y\_P{xi\Pa{xi,G)), 

i=l 

where Pa(xi, G) are all the parents of the node Xi in 
the BN graph G. 

Markov networks (MNs) are undirected graphs which 
can contain cycles. According to the Markov prop- 
erty, a set of nodes are independent with non-adjacent 
nodes conditioned by their immediate neighbours 
which are called Markov Blanket (MB). MRF-F is 



based on the Hammersley-Clifford Theorem ( Clifford 



1990) which tells a joint probability over a MN can 



always be written as a product of functions over all 
maximal cliques: 

1 n 

P(xi,x 2 ,...,x n ) = — Y[(j)i(mci), 



where {mc\, mc2, rnc m } are all the maximal cliques; 
{</>i, <p2, 4>m} are potential functions over maximal 
cliques; and Z is the partition function for normaliza- 
tion. 

The HC Theorem can be proved in a constructive way 



( Cheung 2008 ) by defining a candidate potential func- 



tion as: 

fi(ci)= n P(Xs = xs,x GXs = oy llCll ~ lsl , (i) 

sev(ci) 

i 

P(x 1 ,x 2 ,.:,x n ) =[\_fi(ci), (2) 

where {ci,C2, ...,q} are all cliques in G including the 
empty clique 0; V(ci) is the power set of including 



the boundary cases and ; | * | is the number of nodes 
in *; and P(X S = x Si Xq\ s — 0) is the joint probabil- 
ity with X s set to the corresponding values x s and 
the remainder of the graph X G \ S set to default config- 
uration values denoted as 0. If we group the cliques 
into maximal cliques, then the potential functions over 
maximal cliques are: 

4>i{mci) = ]J -frfe') 

= n n p(x s =x s ,x G \ s =oy 



1 1 c i i 



If we replace the potential function over cliques in 
Eqn.pl) with Eqn.([T]) and apply the Markov property, 
a Refined MRF Factorization (RMRF-F) can be ob- 
tained which can be represented as a factor graph 
(lAbbeel et al.\ [20051: 



p( X1 ,x2,...,x n )=n n p{x s =x s ,x G \s =o) 

i 

= n n [ p ( x » = x *> x °i\* = °i x g\ Ci = o) 

p(x GVi =ot^ 

=n n p(x 3 =x s ,x Cz \ s =o\x MB(Ci) =o)- 11 ' 

i=l sgP(q) 



-l\<H\-\> 



(3) 



where MB(cj) is the Markov Blanket of Cj. Here ac- 
cording to the Markov property, conditioned by X G \ C . 
is equal to conditioned by Xj^b{c)- 

The scopes of the factors in MRF-F are in fact over 
all the variables regarding the default global config- 
uration. The scopes of the factors in RMRF-F are 

{aU MB(a)}. 



CRF-F ( pLafferty et oil |20uT| ) can 
special Mat -t ', wnicn factorizes tl 



be considered as a 
special JVLHf'-l? , which iactonzes the conditional prob- 
ability. The chain structured CRF-F can be written 
in non-exponential form as follows: 

71—1 n 

P( yi ,y 2 ,...,y n \X)= Y[<l>i(yi,yi+i,x)]Jfi(yi,x) (4) 

i=l i=l 

The transition feature functions {4>i} are defined over 
edges {(j/i, conditioned by X and the state fea- 

ture functions {fi} are defined over nodes {yi} condi- 
tioned by X. 

There are several questions arising naturally: (i) Con- 
ditional probability is used to factorize directed graph, 
then is there existing the equivalent for undirected 
graph? Intuitively, this equivalent should be symmet- 
rical, (ii) What are the fi(yi,X) and <j>i(y h y i+ i,X) 
in Eqn.Q indeed? Could they be written as exact 
probability functions? In MRF, they are explained as 
"compatibility" . But this vague intuition is far from a 
precise definition; (iii) Is there existing a unified math- 
ematical foundation for all of these factorizations? 



In this paper, we answer these questions by construct- 
ing a novel mathematical concept co-occurrence rate 
(CR). As CR-F is directly based on the independence 
relations among random variables, it can be directly 
applied to different types of PGMs, as different types 
of PGMs are just different representations of indepen- 
dence relations. CR has simple probability definition 
and clear intuitive interpretation. More important, we 
show that BN-F, CRF-F, MRF-F and RMRF-F are 
all special cases of CR-F. Thus CR provides a unified 
mathematical foundation for factorizing PGMs. CR-F 
can tell us not only the scopes of the factors, but also 
the exact probability functions of these factors. In CR- 
F, each factorizing step corresponds to a graph oper- 
ation. CR-F can be visualized as applying a sequence 
of graph operations, including partition, merge, dupli- 
cate and condition, to the PGM graph. As "Graphi- 
cal models are a marriage between probability theory 
and graph theory" (Jordan 19981, the strong associ- 



ation between probability factorization and graph op- 
erations is a big advantage of CR-F. We also describe 
a systematic way to factorize TCG graphs into factors 
whose scopes are exactly over maximal cliques without 
any default configuration. This improves the results of 
(R) MRF-F and also indicates that (R)MRF-F, as spe- 
cial cases of CR-F, can not always achieve the optimal 
results of CR-F. 

The remainder of paper is organized as follows: in 
Section Q, CR is developed. In Section exam- 
ples are given to demonstrate the CR-F for different 
types of PGMs. We also show that BN-F and CRF-F 
are special cases of CR-F. In Section @, we show that 
(R)MRF-F are special cases of CR-F. Section ^ gives 
a systematic way to factorize TCGs. Conclusion, dis- 
cussion and future work follow in the last two Sections 

del [7k. 



2 Development of CR 

In this section, we construct the novel mathematical 
concept co-occurrence rate (CR) upon the foundations 
of probability theory. The concept of CR was inspired 



by Lenz-Ising model (Ising, 1925 1 . 



2.1 Definition of CR 



CR between two events A and B is defined as: 



CR(A,B) 



P(A,B) 
P{A)P(B)' 



where P is probability. CR can be intuitively inter- 
preted as the interaction between the occurrences of 
A and B: (i) If CR(A,B) = 1, the occurrences of 
A and B are independent; (ii) If CR(A,B) > 1, 
the occurrences of A and B are attractive; (iii) If 
< CR(A,B) < 1, the occurrences of A and B are 
repulsive. 



CR for discrete random variables is defined as: 

P(x 1 ,x 2 , -,x n ) 



CR(x 1 ,x 2 , -,x n ) = 



P{x 1 )P{x 2 )...P{x n y 



For the continuous random variables, we use the prob- 
ability density function p: 

CR(xi,x 2 , ...,x n ) 

P(xi — ei < xi < xi + ei, x n - e n 

^ X n X n -\- 6n) 



and we can get the following theorem which allows the 
condition operation on the graph to deal with the 
incomplete graph as demonstrated in Section ( |3.4[ ): 

Condition Theorem 



CR(x 1 



P(xi, 



lim 

eio Plxi — ei < xi < xi + ei)...P(x n — e„ 



(10) 



lim 



lim 



n ■ ■ ■ fx "-^ P{xi,-,x n )dx 1 ...dx n 
_ e e ™ p(x„)dx n 
p(xi, ...,x n ) 



^ ./V ',' /-nV.r,..// 
2e 1 ...2e n p(x 1 , ...,x n ) 



e±o 2e 1 p(x 1 )...2e n p(x n ) p(x 1 )...p(x n ) 

In the rest of this paper, we only discuss the discrete 
situation. It can be easily extended to continuous ran- 
dom variables. 



P(xi,x 2 , ...,£„) = CR{x ll x 2 , ...,x n )P(xi)...P(x n ) 



So instead of factorizing the joint probability, we can 
first factorize its CR, and then replace the CR in 
Eqn.([6]) with the factorized CR. If there is only one 
random variable: 



CR(x) = 



P(x) 
P(x) 



= 1 



(7) 



This can be intuitively explained as one event can hap- 
pen independently by itself. But CR(%) 



defined, as P( 



0. 



_ PM 

P(0) 



is un- 



conditional probability can be written as CR func- 
tions: 



P(xi, X 2 , 



_ P{X1,X2, ...,X„,X) 

P(x) 

= CR(xi,X2,—,X n ,X 



(8) 



Notice that CR(A, B, C) 
from CR(A,BC) 



_ P{A,B,C) 
~ P(A)P(B)P(C) 
P(A,BC) _ P(A,B : C) 



is different 

The 



P(A)P(BC) ~ P(A)P(BC) 

first CR means the co-occurrence rate among three 
events {A, B, C}. In the second one there are only two 
events : A and a joint event BC. But there is no such 
difference for P: P{A, B, C) = P{A, BC) = P(ABC). 
This complies with the intuition of CR and P. 

2.2 Definition of Conditional CR 



The Conditional CR is defined as: 

P(x u 



CR(xi, ...,x n \x) 



.., x n \ x) 



p{ Xl \x)...p(x n \ x y 



(9) 



which is the co-occurrence rate of {x\, x n } condi- 
tioned by x. Then, 

P(xi, ...,x n ,x)P(x) n 



CR(xi, ...,x n \x) 



P(x)P(x 1 ,x)...P(x n ,x) 

CR(xi : x n , x) 
CR{x\, x)...CR(x ni x) ' 



P( Xl )...P(x n ) P( Xl )...P(x n ) 

T, x CR(xi, ...,x n ,x)P( Xl )...P(x n )P(x) 
P( Xl )...P(x n ) 

= CR(xi, ...,x n \x)CR(xi, x)...CR(x n , x)P(x). 

x 

2.3 Commutative 



If we consider CR as a operation on a set of variables, 
then CR is commutative: 



(6) CR ( X 



P(x a (l), £a(2)) £ a („)) 

P(Xa(2))P(x a( 2))...P( *^a(n) ) 

_ P(Xb(l),Xb(2), -;Xb(n)) 
P(Xb(2))P{x b ( 2 ))...P(x b ( n )) 

= CR(Xb(l),Xb(2), ■■■,Xb(n)), 



where a and b are different permutations of (1, 2, n). 
This commutative law is important because it allows 
us to partition or merge the graph in any way. 

2.4 Marginal CR 

Random variables in CR can be eliminated by 
marginally summing up: 



^ C7?(xi, x 2 , ...,x n - 1 ,x n )P(x n ) 
P( Xl ,x 2 , 



E 
E 



•t'n— 1 j % n 



P{x 1 )P{x 2 )...P{x n -i)P{x n 

P(X!,X2, ...,X n -l,X n ) 



P(x n ) 



^ P(x 1 )P(x 2 )...P(x n - 1 ) 

P(xi,X 2 , ...,X n -l) 

P(x 1 )P(x 2 )...P(x n _ 1 ) 
CR( Xl ,x 2 , 



If n = 2: 



E 

X2 



CR{x u x 2 )P{x 2 ) = CR{ Xl ) = 1. 



where CR(xi) = 1 by Eqn.Q. 



2.5 Bi-partition Theorem 

This is the critical theorem which allows the bi- 
partition operation on the graph to factorize a CR 
into three parts (the left, the right and the cut between 
the left and right): 



CR(xi, Xk,Xk+l,—,X n ) 

= CR(xi, ...,x k )CR(xk+i, 



(11) 

,x n )CR(x 1 ...x k ,x k+1 ...x n ) 



This theorem can be proved as follows: 
CR(xx,...,Xk)CR(x k +i, ...,x n )CR(xx...x k ,x k+ x...x n ) 

__ P(xi,...,X k ) P(x k+1 ,...,X n ) 

P(zi, ...,a;fc)P(x fe+ i, ...,a;„) 
P{x\ , . . . , x n ) 
~ P(.Ti)P(x 2 )...P(a;„) 
= CR(xi, ...,x k ,x k+ i, ...,x n ) 

Bi-partition Theorem can be recursively used to fur- 
ther factorize the new CRs. 

2.6 Merge Theorem 

This theorem allows the merge operation which is 
inverse to partition operation. 

CR(xi,...,x k ,x k +i,...,x n ) (12) 
= CR{x\, ...,x k x k+1 , ...x n )CR(x k ,x k+ x) 

where two subgraphs x k and x k+ \ are merged into one 
part x k x k+ i and a new factor CR(x k ,x k +i) is gener- 
ated. This theorem can be proved as: 

CR(xi, x k x k+ i, ...x n )CR(x k , x k+ x) 

_ P(xi, ...,x n ) P(x k x k+ i) 

P{x x )...P{x k x k+1 )...P{x n ) P(x k )P(x k+1 ) 

P(xi, ...,x k x k+1 , ...,x n ) 
P(x 1 )...P(x k )P(x k+ i)...P( 

Xn ) 

= CR(xi, ...,x k , x k+ i, ...,x n ) 

There is a corollary following directly from this Merge 
Theorem and the Independence Theorem (Eqn |14[ ): 

if (xk -L Xfc+i), then: 

CR(xi, ...,x k ,Xk+i,—,x n ) = CR(xi, ...,x k Xk+i, —x n ) 

That is merging two independent random variables 
does not affect the global CR value. 

2.7 Duplicate Theorem 

This theorem allows duplicate operation to dupli- 
cate a random variable which already exists in the CR. 
This theorem is very useful when we manipulate over- 
lapping subgraphs: 

C R{x\ , . . j X{ , . . j Xyi} CR\X\ , . . , x% , x% , x ?l )P(xj). 

(13) 

This theorem can be proved as follows: 

CR\% 1 j X2 3 ••• 5 ^13 %ii •••■) &n)P\Xi) 

_ P(x 1 x 2 ...x n ) 

~ P{x 1 )P{x 2 )...P(x l )P{x l )...P(x n ) [Xl > 

P(x 1 x 2 ...x n ) 

~ P{x 1 )P{x 2 )...P{x l )...P{x n ) 
= CR(xi, x 2 , Xi, x n ). 



2.8 Independence Theorem 

If {xi, x 2l x n } are mutually independent: 

CR(x u x 2 ,...,x n ) = 1. (14) 

2.9 Conditional Independence Theorems 

2.9.1 The First CIT 

If (xix 2 ...x k _L y 1 y 2 ...yi\w 1 w 2 ...w m ), then: 

CR(x 1 x 2 ...x k ,y 1 y 2 ...y l w 1 w 2 ...w m ) 
= CR(x 1 x 2 ...x k ,wiw 2 ...w m ). (15) 

This theorem is used to reduce the random variables 
after a partition or merge operation. This theorem can 
be proved as: 

{x x x 2 ...x k _L y 1 y 2 ...y i \w 1 w 2 ...w m ) => 
P(xi...x k yi...yiwi...w m ) 
_ P(xi...x k w 1 ...w m )P(y 1 ...yiw 1 ...w m ) 
P(wi...w m ) 

then, 

CR{xix 2 ...x k ,y 1 y 2 ...yiWiw 2 ...w m ) 
_ P(x 1 ...x k y 1 ...yiw 1 ...w m ) 

P(x 1 ...x k )P(yi...yiWi...w m ) 
_ P(x 1 ...x k wi...w m ) 

P(x 1 ...x k )P(w 1 ...w m ) 
= CR(x 1 x 2 ...x kl w 1 w 2 ...w m ) 

2.9.2 The Second CIT 

If (xix 2 ...x k _L y 1 y 2 ...yi\wiw 2 ...w m ), then: 

CR(w 1 w 2 ...w m ,xix 2 ...x k yiy 2 ...yi) 

_ CR(xix 2 ...Xk,w 1 w 2 ...w m )CR{yiy 2 ...yi,w 1 w 2 ...w m ) 
CR(x 1 x 2 ...x k ,y 1 y 2 ...yi) 

This theorem is useful, because each CR on the right 
side has fewer random variables than the left CR. 

CR(w 1 w 2 ...w„ ll xix 2 ...x k yiy 2 ...yi) 
__ P(wiw 2 ...w m xix 2 ...x k yiy 2 ...yi) 
P(w 1 w 2 ...w m )P(xi...x k yi...yi) 
_ P(w 1 ...w m x 1 ...x k )P(w 1 ...w m y 1 ...yi) 

P(wi...w m )P{w 1 ...w m )P(xi...x k y 1 ...yi) 
_ CR(xix 2 ...x k ,WiW 2 ...w m )CR(yiy 2 ...yi,w 1 w 2 ...w m ) 
CR(xix 2 ...x k ,yiy 2 ...yi) 

2.9.3 The Third CIT 

If (xix 2 ...x k _L y 1 y 2 ...yi\w 1 w 2 ...w m ), then: 

CR(w 1 ...w m xi...x k ,'wx...w m y\...yi) (16) 
= CR(wi...w m ,w 1 ...w m ) 
1 

P{wiw 2 ...w m ) 



This theorem is useful when we deal with the overlap- 
ping clusters. 

CR(wi...w m xi...Xk,wi...'w m yi—yi) 
_ P(w 1 ...w m xi...x k yi...yi) 

P(wi...w m x 1 ...Xk)P(w 1 ...w m y 1 ...yi) 
1 P(wi...w m ) 



P(wi...w m ) P(wi...w m )P(w 1 ...w m ) 

P(wi...w m , ffii...ra m ) 
P(w 1 ...w m )P(wi...w m ) 
= CR(w 1 ...w m ,w 1 ...w m ) 

2.10 Unconnected Nodes Theorem (UNT) 

Suppose {a, b} are two unconnected nodes in G. That 
is there is no direct edge between a and b. Then a _L 
b\MB(a, b), where MB(a,b) is the Markov blanket of 
{a, b}. And suppose W,XG V(G\{a, b}) including the 
boundary cases {0, G\{a, b}}, MB(a, b) C WUX, and 
W n X = 0. Then (a _L b\W, X) and we get the UNT: 

CR(W, a = 0, b = 0, X = 0)CR(W, a,b,X = 0) (17) 
= CR{W, a = 0, b, X = 0)CR(W, a, b = 0, X = 0) 

For the left side, we partition (EqnpTl) a out and apply 
the first CIT (EqnjTKj): 

CR(W, a = 0, b = 0, X = 0)CR(W, a,b,X = 0) 
= CR(W, b = 0, X = 0)C*i?(a = 0, = 0) 
or(w; 6, X = 0)^(0, WX = 0) 

For the right side, we also partition a out and apply 
the first CIT: 

CR(W, a = 0, b, X = 0)C*i?(W; a, 6 = 0, X = 0) 
= Ci?(W, 6, X = 0)OR(a = 0, WX = 0) 
OR(W, 6 = 0, X = 0)C*J?(a, WX = 0) 

As the left side equals the right side, we proved the 
theorem. 

3 Examples 

In this section, we demonstrate CR-F on different 
PGMs based on the results obtained in Section pi). 

3.1 Example 1: A Bayesian Network 




Figure 1: A BN (Roller fc Friedman 20091 



Fig.Q is a Bayesian network. By Eqn.([6|: 

P(D,I,G,S,L) 

= CR(D, I, G, S, L)P{D)P{I)P(G)P{S)P{L). 



We go on to factorize CR(D, I, G, S, L). Factorization 
using CR is to apply a sequence of graph operations 
including partition, merge, duplicate and condition to 
the graph. After each operation, we check if the CITs 



in Section (2.9) can be applied to reduce random vari- 



ables. As there are a lot of such operation sequences, 
consequently we can get a lot of different factorization 
results. All of them are mathematically correclQ We 
illustrate two of them as follows: 

Factorization 1 (by partition): 

Stepl: ({D, I, G, S, L}) -> ({£>}, {I, G, S, L}). 

CR(D, I, G, S, L) = CR(D)CR(I, G, S, L)CR(D, IGSL) 
= CR(I,G, S,L)CR(D,G) 

We get the first equation by partition operation 
(Eqnpdj) . We get the second equation by the First 
CIT (Eqn(l5} as (D _L ISL\G). And CR(D) = 1 
(Eqn]7f. 

Step2: ({/, G, S, L}) -> ({S}, {I, G, L}). 

CR{I, G, S, L) = CR(I, G, L)CR(S, I) 
Step3: ({I,G,L}) -> ({I},{G,L}). 

CR{I, G, L) = CR(I, G)CR(G, L) 

Finally: 

CR(D, I, G, S, L) = CR(D, G)CR(S, I)CR(I, G)CR{G, L) 

Factorization 1 (by merge): 
Stepl: {D, I, G, S, L} -> {D, I, S, GL}. 

CR{D, I, G, S, L) = CR(D, I, S, GL)CR(G, L) 

We get this equation by merge operation (Eqn|l~2"|. 
Step2: {D, I, S, GL} — > {D, S, IGL}. 

CR(D, I, S, GL) = CR(D, S, IGL)CR(I, GL) 
= CR(D,S, IGL)CR(I, G) 

Step3: {D, S, IGL} -> {S, DIGL}. 

CR{D, S, IGL) = CR(S, DIGL)CR(D, IGL) 
= CR(S, I)CR(D,G) 

Finally: 

CR(D, I, G, S, L) = CR(G, L)CR{I, G)CR(S, I)CR(D, G) 
Factorization 2: 

In the remainder of the paper, we only demonstrate 
factorization by partition. Factorization by merge can 
be easily obtained by merging the nodes in the reverse 
direction of factorization by partition. 



1 The logical consideration of the relation between BN-F 
and CR-F will be discussed in another paper. 



Stepl: ({£>, /, G, S, L}) -> {{S}, {I, G, D, L}). 

CR(D, I, G, S, L) = CR(I, G, D, L)CR(S, I). 

Step2: ({/, G, D, L}) -> ({D, I}, {G, L}). 

CR(I, G, D, L) = CR{D, I)CR(G, L)CR(DI, GL) 
= CR{G,L)CR(DI,G) 



We get the second equation as (D _L I) (EqnjMj) and 

(DI±L\G). 

Finally: 

CR(D, I, G, S, L) = CR(S, I)CR{DI, G)CR(G, L) 

If we group the CRs in the above equation into proper 
scopes, we can get the result of BN-F: 

P(D,I,G,S, L) 

= CR(S, I)CR(DI, G)CR{G, L)P{D)P{I)P(G)P{S)P{L) 
= P(D)P(I)CR(DI, G)P(G)CR(S, I)P(S)CR(G, L)P(L) 
= P{D)P{I)P{G\DI)P{S\I)P{L\G) 

The factors in BN-F can be obtained by keeping all 
the fathers of a node in the same part when we are 
partitioning the graph. So BN-F can be considered as 
a special case of CR-F. 

3.2 Example 2: Tree-Structured Markov 
Network 




Figure 2: A Tree-Structured Markov Network 

The tree-structured Markov network can be factorized 
by partitioning one leaf out each time. This results in 
the factors over all the edges and nodes. 

P(yi,V2, ...,y n ,Xl,X2, ...,x n ) 

n n 

= CR(y!,y 2 , y n , x ly x 2 , x n ) J} P(yi) J} P(xi) 



i=l i=l 
n n 



= n cR(xi, yi ) n cRto-uvi) n pm n p ^ 



n 



P{xi,Vi) t-t P(Vi-i,Vi) 



i=l i=l 

n n 



n 



Y\P{y,)Y[P{xi 



L P(xi)P( yi ) l\ P(vi-i)P{iH) f = \ 



3.3 Example 3: Chain-Structured CRF 




CRF can be considered as a special MRF which is to 
factorize the conditional probability. Here we show 
that CRF-F is a special case of CR-F: 



P{y u y 2 , y n \X) = CR( Vl ,y 2 , y n \X) ]J P{ Vi \X) 



(18) 



Y[CR( yi ,y i+1 \X)l[P( yi \X) 



P{yi,yi+i\x) 



2 P( yi \X)P(y i+1 \X) l\ 



(19) 



(20) 



We get Eqn.Q by Eqn.((9]). We obtain Eqn.([l9| from 
Eqn.(18l because under the condition X, {yi,...,y n } 
are chain structured and can be partitioned as Exam- 
ple 2. We can see that CR{yi 1 yi + i\X) and P(yi\X) 
are just the transition feature functions and state fea- 
ture functions in CRF-F (EqnQ, respectively. CR-F 
tells us not only the scopes of the factors, but also 
the exact probability functions of these factors, where 
<t>i{VhVi+i,X) = CR(yi,y i+1 \X) = P ^\x)P(yl+]x) 
and fi(yi,X) = P{y l \X). CRF-F can not tell us the 
exact probability functions of the factors. 

3.4 Example 4: Arbitrary Markov Network 




Figure 4: A Markov Network 

In this example, we show how to factorize an arbitrary 
Markov network. Especially, we demonstrate how to 
deal with the incomplete graph by using the condition 
operation (Eqn[l0]): 





Figure 3: Chain-Structured CRF (Lafferty et al. 2001) 



Figure 5: The Incomplete Structures 
Factorization: 

Stepl: ({A, B, G, D, E}) ->■ ({G}, {A, B, D, E}). 

CR(A, B, G, D, E) = CR{C, AD)CR{A 1 B, D, E) 

Now come the incomplete structures {A, B, D, E} as 
shown in Fig.([5|. Should we go on to factorize 
CR(A, B, D, E) using the left structure or the right 
structure? According to the independence seman- 
tics of the original graph, we have (G _L B\AD), 
(A _L D\BC) and (E _L AD\B). We have already used 



the (C _L B\AD) at the first step. (E _L AD\B) is 
not related to C, so no matter the left structure or the 
right structure, it always holds. There are two choices 
for the (A _L D\BC). The left structure means under 
the condition C, (A _L D\B); and the right structure 
means (A JL D\B). Both of them are correct. 

Step2 (Left /Right): {A,B,D,E} -> ({A, B, D}, {E}). 

CR(A, B, D, E) = CR{A, B, D)CR{D, E) 

Step3(Left): {A,B,D\C} -> ({A,B\C},{D\C}). 
By the condition operation (EqnflO|): 

CR(A,B,D) 

= J2\CR(A, B, D\C)CR(A, C)CR(B, C)CR(D, C)P(C)] 

c 

= Yy° R ( A ^ B\C)CR(B, D\C)CR{A, C)CR(B, C) 

c 

CR(D,C)P{C)} 
Step3(Right): {A, B, D} -> {A, B, D}. 

CR(A, B, D) = CR{A, B, D) 

The results of Step3(Left) and Step3(Right) are equal 
regarding the independence semantics of the original 
graph in Fig.Q. With the condition operation we can 
utilize all conditional independences. In this example, 
if we did not use condition operation, then the condi- 
tional independence (A _L D\BC) could not be used. 

4 CR-F and (R)MRF-F 

Using CR-F, there can be a lot of different ways to 
factorize a graph. In this section, we show that the 
factors of (R)MRF-F can be obtained by a very special 
operation sequence of CR. Thus (R)MRF-F are just 
special cases of CR-F. 

Suppose the nodes in G: G = {<7i , .92 3 ---j 9n\- For each 

5 G V{G)\G including repeat the following two steps 
for 2l G H 5 l- 1 times: 

1. Duplicate (Eqnfl3"|) the nodes in G: 

CR(G) = CR(G,G)P( 9l )...P(g n ) 

2. Partition the G out: 

CR{G) = CR{G,G)P{gi)...P(g n ) 

= CR{G)CR(G,G)CR{G)P{ gi )...P{g n ) 

= CR(G) P M^4 CR(G) 



CR{G) 
CR{G) 



P(gi, -,9n) 

CR(G) 



As 



ami we get: 



1, we can assign arbitrary values to G\S, 



CR < G » = ZTc\s = o) CR(G} 



Then factorize the CR(G) on the right side for the 
next S. And finally we get: 



cr(g) = [ n ( 



Se{V(G)-G} 



CR{S, G\S = 0) 2 |Q|-|s|-i 
CR{S, G\S = 0) ' 



]CR(G) 
(21) 

This equation seems pretty special (stupid?). Now in 
fact we have already obtained the factors in MRF-F 
by CR-F. What remained is to group these factors into 
proper scopes. The scopes are just all the subset of G: 
V{G) including and G. For each scope S € V{G), 
we select the following factors in Eqn.|2l|) into S: 



{CR(W,G\W = 0) 



(-1)' 



,Wer(S)}. 



We call these factors as W factors. The following two 
binomial equations guarantee that all the factors are 
just be selected into scopes V{G) in this way: 



2 \g\-\w\ = n + 



\w\ 



_ ^0 

- ^\G\-\W\ 



■ C \G\-\W\ 



Q |Gf|-|W| = (1 _ 1)|G|-|W| 
= C\G\-\W\ ~ C\G\-\W\ + 



r \G\-\W\ 

^\G\-\W\ \ AA > 



(_ A \\G\-\W\ r \G\-\W\ 
V 1 ) U |G|-|W| 

(23) 



The number of W factors in Eqn.(21) is 2 * 

2 \G\-\W\-1 = 2 \G\~\W\_ jj alf of them are in numer _ 

ator and the other half in denominator. W factors are 
included once by each of {5, W C S}. Eqn.Q tells 



the number of {S} which contain the W factor is also 
2|G| — |vk| ^ g0 j-^g f ac ^ ors are j us t included into 

{S}. Eqn.([23| tells half of {S} select the W factors in 
the numerator and the other half select the W factors 
in the denominator. 

We go on to prove that if a scope S is not a clique, all 
the factors selected into S cancel themselves out: 

1. If (S is not a clique, then there must be two uncon- 
nected nodes {a, b} in S. 

2. Suppose W e V(S\{a,b}). Thus all the subsets 
in S can be categorized into four types: W, W U {a}, 
W U {b} and W U {a,b}. And they must be in the 
following form in the scope S: 



hs) = n 



CR(W,a = 0,b = 0,X = 0)CR(W, a, b, X = 0) v 
CR(W, a = 0, 6, X = 0)CR(W, a, b = 0, X = 0) 1 



(24) 

where X = G\{W,a, b}. The absolute positions of 
these four factors are not important. We only need 
their relative positions are correct as they will cancel 
themselves out. So we denote the power as —1*. As 
MB(a, b) C W\JX = G\{a, b} and WHX = 0, accord- 
ing to the UNT (Eqn|l~7|), if we assign all the default 
values from an arbitrary but fixed global configuration, 
then <f>(S) = 1. □ 

Now only the factors in cliques are left. Cliques {ci} 
can be categorized into three types: 0, \ci\ — 1 and 



\ci\ > 2. The factor in the empty clique is: CR(G = 0); 
the factors in one node clique are: c^g 9 =oG\g^=o) ' 
where gi is the unique node in this clique; and factors 
in multi -no de cliques can be written in the same form 
as Eqn.(24|, where {a, 6} can be any pair of nodes in 
the clique. Then 



P(g 1 ,...,g n ) = CR(g 1 ,...,g n )P(g 1 )...P(g n ) 
CR(g i ,G\g i = 0)P(g i ) 



(25) 



n n 

|c;|>2 u; 



= 0,G\ 5 i = 0) 

CR(w,a = 0,b = 0,X = 0)CR(w,a,b,X = 0) x l 
CR{w,a = 0,b,X = 0)CR(w,a,b = 0,X = 0) J ' 



The clique graph ( CG ) of a given graph G(V, E) is a 
graph G'(V', E'). The nodes of G' are defined as V = 
{Ci, C*2, C n }. There exists a one-to-one mapping 
between {C%, C2, C n } and all the maximal cliques 
{ci, C2, c„} m G. The edges in G' are defined as 
E> = {(Ci,Cj); V(a) n V( Cj ) ■ 0:1 • /../ • „:/ , .,}. 

Here we define Tree structured CG (TCG) by Alg. ([I). 
Notice that according to our definition whether a CG 
is TCG can not be simply judged by existence of cycles 
in the CG. Even a CG contains cycles, it may also be 
a TCG as the example shown in Fig.Q. 



where w £ V(ci\{a,b}) and X = G\{w,a,b}. If we 
substitute the CRs in Eqn.(25l with their probability 
definition (Eqn{5|, we get MRF-F (Eqnjl]) exactly. As 
we can obtain the factors in MRF-F by CR-F, MRF-F 
can be considered as a special case of CR-F. 




Figure 6: A graph and its TCG 



We can further refine the scopes in Eqn.(25) 



CR( 9l ,G\g t = 0) = CR{ gi ,MB(g i )=0,X = 0) 
CR{ gi = 0, G\ gi = 0) CR{ gi = 0, MB{g z ) =Q,X = 0) 
= CR{gj)CR{g u MB{ gi ) = Q)CR(MB(g t ) = 0,X = 0) 
CR( gi = 0)CR( gi = 0,MB( gi ) = 0)CR(MB( 9i ) =0,X = 
CR( gi ,MB{ gi )=0) 



CR{gi = Q,MB( gi )=Qy 

where X = G\{g i} MBfa)}. And also: 

CR(w, a = 0, b = 0, X = 0)CR(w, a,b,X = 0) 
CR(w, a = 0, b, X = 0)CR(w, a, b = 0, X = 0) 
_ CR{w, a = 0, b = 0, M = 0, = 0, H = 0) 
~ CR(w, a = 0, b, M = 0, N = 0, H = 0) 
CR(w, a, b, M = 0, N = 0, H = 0) 
CR(w, a, b = 0, M = 0, N = 0, H = 0) 
_ CR{w, a = 0, b = 0, M = 0, N = 0) 
" CR{w, a = 0, b, M = 0, N = 0) 
CR(w, a, b, M = 0, N = 0) 
Ci?(w, a, 6 = 0, M = 0, N = 0) 



(26) 



(27) 



where M = c\{w, a, 6}, TV = MB(c) and i7 = G\{c U 
MB(c)}. If we first replace Eqn.([25]) with Eqn.pe]) 



and Eqn.(27l, and then replace the CRs in the new 
equation using Eqn.(l8l with N — as the condition, 
we get RMRF-F(Eqn|3| exactly. We can see that in 
the refinement steps Eqn.(26) and Eqn.(27), we just 



further applied the partition operations and first CIT 
to the existing factors. That means we can get the 
factors of RMRF-F by a sequence of graph operations. 
Therefore RMRF-F is a special case of CR-F. 

5 Factorizing TCG 

In this section, we describe a systematic way to fac- 
torize TCGs into factors which are defined exactly 
over the maximal cliques without any default config- 
uration. First, we review the concept of clique graph 
( |Hamelink 19681 in graph theory. 



Algorithm 1 isTCG 

Input: G(V,E) and its CG G'(V',E') 
— while true do 
0) if \V'\ < 1 then 
return true; 
end if 

noChange — true; 
for % = 1 to \V'\ do 

{Here adj(Ci) = {C/., — , C;} are all the adja- 
cent nodes of Ci in G' and {cfc, q} are their 
corresponding maximal cliques in G.} 
if 3Cj VC h c, n c h C a n c^C^Ch € adj(G t ) 
then 

G' = G' -Ci; 
noChange = false; 
break; 
end if 
end for 

if noChange == true then 

return false; 
end if 
end while 



TCGs can be factorized as follows: 

StepO: P(xi, x\ v \) = CR(xi, ...,x\ v \)P(xi)...P(x\ v \)- 

Stepl: Select a node Ci, for which 3Cj VCh CiDch C 
CiDcj; Cj, Ch £ adj(Ci). We call Cj as maximum adja- 
cent node of and denoted as Maxadj(Ci). Alg.Q 
guarantees that for a TCG there always exists such 
a node during the factorization process. Duplicate 
{x k ,...,xi} = V(ci)nV(cj): 

CR(xi, —,x\ v \) = CR(x 1 ,...,x\ v \,x k , ...,xi)P(x k )...P(xi) 

Step2: Then we partition the random variables 
{x!,x 2 , ...,x\ v \,x k , xi} into two parts: {x p , ...,x q } = 



V{ci) and the remainder {a;^, x m } — UV(c\ci). 
CR(xi,x 2 , ...,x\ v \,Xk, ...,xi) 

= C R^Xp, . . ., Xq^C ' R^Xh ; ■• • , X m ^C R^Xp .. .Xq , Xh...X m ^ 

= CR(x p , ...,x q )CR(x h , ...,x m )CR(xk—xi,Xk—xi) (28) 

1 



)CR(xh, ...,x 

m ) 



P(x k , ...,Xi) ' 



(29) 



We obtain Eqn.(28l from Eqn.(16). {x^.-.x;} com- 
pletely separate Ci from the remainder of G, so 
(x p ...2;fc_ix; + i...a; g _L x h ...x k -ixi + i...x m \x k ...xi). As 
in Eqn.(p9]) {x p , x q ] = V(ci) and {x k ,...,xi} = 
V(ci)n V(cj), the scope of C p[x P k '''' x ^ is just the max- 
imal clique Cj. Repeat Stepl and Step2 until only one 
clique left: 



P(xx,x 2 , .-,X\ V \) 



\V'\-1 



n r c^ym n P(Xi)] 

CR{V{c\ v ,\)) J] Pfa), 

«<eV(C|y/|) 



where Civi is the root of G", which is the final clique 
left in Alg.([T]). Therefore the probability functions 
over maximal cliques can be written as follows: 
If Ci is not the root of G'; 



4>i{ci) = 



CR(V( Ci )) 



P(V( Cl ) n V{Maxadj{ Ci ))) 



n P ^ 



XiEVici) 



P(V(d)) 



P{V{ Cl ) n V{Maxadj{a))) ' 
If d is the root of G": 

(j>i{ Ci ) = CR(V(a)) J] P(xi) = P(V(a)). 

l,£F(c,) 

6 Conclusion 



all special cases of CR-F, the learning and inference 
methods based on the results of these factorizations 
can also be applied to CR-F. Using CR-F, we may 
get factorizations that consist of much fewer factors 
defined on local scopes. And more important, these 
factors can be written as exact probability functions. 
This should benefit learning and inference. 

References 

Abbeel, P., Roller, D., & Ng, A. (2005). Learning fac- 
tor graphs in polynomial time & sample complexity. 
In UAI-05. Arlington, Virginia: AUAI Press, 1-9. 

Bishop, C. M. (2007). Pattern Recognition and Ma- 
chine Learning. Statistical Science, 1 ed. 

Cheung, S. (2008). Proof of Hammcrsley- Clifford The- 
orem. Tech. rep. 

Clifford, P. (1990). Markov random fields in statistics. 
In Disorder in Physical Systems. 19-32. 

Hamelink, R. C. (1968). A partial characterization of 
clique graphs. In Journal of Combinational Theory. 
192-197. 

Ising, E. (1925). Beitrag zur theorie des ferromag- 
netismus. In Zeitschrift fur Physik A Hadrons and 
Nuclei, vol. 31, 253-258. 

Jordan, M. I. (1998). Learning in Graphical Models. 
MIT Press. 

Roller, D. & Friedman, N. (2009). Probabilistic Graph- 
ical Models: Principles and Techniques . MIT Press. 

Lafferty, J. D., McCallum, A., & Pereira, F. C. N. 
(2001). Conditional random fields: Probabilistic 
models for segmenting and labeling sequence data. 
In ICML-01 . 282-289. 

Pearl, J. (1986). Fusion, propagation, and structuring 
in belief networks. In Artificial Intelligence, vol. 29, 
241-288. 



In this paper, we constructed the novel mathemati- 
cal concept CR upon the foundations of probability 
theory. CR provides a unified mathematical founda- 
tion for factorizing PGMs. We illustrated that BN-F, 
CRF-F, MRF-F and RMRF-F are all special cases of 
CR-F. The factors of CR-F can be written as exact 
probability functions. We described a systematic way 
to factorize TCG with factor scopes exactly over max- 
imal cliques without any default configuration, which 
improves the results of (R)MRF-F. 



7 Discussion and Future Work 



In this paper, we focussed on constructing the math- 
ematical foundation for factorizing PGMs and do not 
mention learning and inference methods. But please 
notice that as BN-F, CRF-F, MRF-F and RMRF-F are 



